1 00:00:00,000 --> 00:00:05,000 Thanks, much 2 00:00:05,000 --> 00:00:10,000 Hi everyone. My name is Chris Freelan and I'm a librarian at the Internet Archive. 3 00:00:10,000 --> 00:00:16,000 So here we are today at our final session of the library as laboratory series over the past 10 weeks 10 weeks. 4 00:00:16,000 --> 00:00:26,000 If you can believe it, we've brought together, some of the world's, leading scholars in the digital humanities and data intensive sciences to talk about their projects, and how they're using the Internet archive in the course, of 5 00:00:26,000 --> 00:00:34,000 their research, we started off our series by asking the bold question, What can you do with billions of archived web pages? 6 00:00:34,000 --> 00:00:43,000 And 2 weeks ago we ran it out. the formal part of our series by asking, What can you do with 60 million digitized pages of scientific literature in between? 7 00:00:43,000 --> 00:00:54,000 We've heard from bibliographers authors educators, archivists, and data scientists about how they're using material services and infrastructure from the Internet Archive. 8 00:00:54,000 --> 00:01:00,000 The recordings for all of those sessions are now available online. and I see the Duncan who's working behind the scenes today along with Caitlin. 9 00:01:00,000 --> 00:01:06,000 Hello, both of you, and thank you, has shared that out in the into the chat. 10 00:01:06,000 --> 00:01:10,000 But we're gonna do something a little different today when we started planning this series. 11 00:01:10,000 --> 00:01:21,000 We had inquiries from a number of scientists who wanted to do shorter presentations, you know, get something more like a a status update or an overview of the research, not a 45 min lecture, with 12 00:01:21,000 --> 00:01:34,000 a long Q. A. So we put out a call for lightning talks, And today we're gonna bring you a variety of short talks and videos from researchers working on topics as varied as understanding the effects of emotional 13 00:01:34,000 --> 00:01:37,000 queues by the chairs of the Us. Federal Reserve. 14 00:01:37,000 --> 00:01:44,000 On financial markets to a diy book scanning robot So here's the game plan for today. 15 00:01:44,000 --> 00:01:47,000 Live. transcripts are available, captions are available. 16 00:01:47,000 --> 00:01:51,000 Use the live transcripts feature of Zoom to turn those on. 17 00:01:51,000 --> 00:02:01,000 We also you can copy from within the within the chat by just mouseing over anything individual that you're interested like links that we're gonna share. 18 00:02:01,000 --> 00:02:09,000 And also we're gonna capture all of the links and the chat and make those available including the video, all the resources will be available. 19 00:02:09,000 --> 00:02:15,000 So if you're if you're trying to catch something if you want to remember something a copy a link that we've shared. 20 00:02:15,000 --> 00:02:28,000 Rest assured that will make it into the email that we're gonna send you tomorrow with with the video for today's session. and the other resources that we've shared as you see the chat is open please do 21 00:02:28,000 --> 00:02:33,000 be respectful and keep the comments on topic and use the Q. A. 22 00:02:33,000 --> 00:02:41,000 Feature to submit questions for our panelists and We'll have time for one, probably one question per speaker. 23 00:02:41,000 --> 00:02:53,000 So do please submit questions, for us to to gather through gather from, and a final thing I do want to mention about time. I anticipate that we're definitely gonna run over an hour today. 24 00:02:53,000 --> 00:03:06,000 Probably more towards 90 min. So for those of you who need to to to depart at the top of the hour. Rest assured we're gonna record all of this, and it'll be made available to everyone so for now I can see 25 00:03:06,000 --> 00:03:09,000 people are already using the chat. Please do say hello! 26 00:03:09,000 --> 00:03:14,000 Let us know who you are and where you're joining in from today. 27 00:03:14,000 --> 00:03:28,000 So I actually was just looking back over my my planning notes, and we started talking about what became this library as laboratory series last September, and, like many ideas at the Internet Archive, it burst forth with 28 00:03:28,000 --> 00:03:35,000 enthusiasm. and gusto in a meeting with Brewster Kale the founder and digital librarian of the Internet Archive. 29 00:03:35,000 --> 00:03:41,000 So i'd like to welcome brewster to the screen and Brewster, I'd like to have you share a bit of your thinking. 30 00:03:41,000 --> 00:03:48,000 Why was organizing this digital humanities expo a priority for you and for the Internet Archive? 31 00:03:48,000 --> 00:03:58,000 Oh, thank thank thank you. Chris. And the this is really fulfilling the dream of the Internet Archive. 32 00:03:58,000 --> 00:04:04,000 We We started by trying to archive the Internet and how do you go and do that, and then expand it to to archiving other things. 33 00:04:04,000 --> 00:04:08,000 But it wasn't just to make It so that you can go and find old web pages. 34 00:04:08,000 --> 00:04:19,000 It was to try to get a bigger view. Can we go and help make it so that lots of new and different things can happen without having to build your own collection. 35 00:04:19,000 --> 00:04:27,000 Yourself build a library that you can just go and have your new idea, and all of the materials needed for your research are on the shelves. 36 00:04:27,000 --> 00:04:34,000 But now just digital shelves. So that that was sort of the the impetus for the Internet archive in general. 37 00:04:34,000 --> 00:04:39,000 And so this idea of having people be able to use these collections at scale has been so important. 38 00:04:39,000 --> 00:04:44,000 But actually it's really pretty hard the collections are I think in pretty good shape. 39 00:04:44,000 --> 00:04:50,000 But they're just huge and hard to find and and figuring out how to go and and use them. 40 00:04:50,000 --> 00:04:56,000 But we have seen fabulous presenters, and including the ones that did today. 41 00:04:56,000 --> 00:05:05,000 But the urgency of Miss Information, and feeling that our information ecosystem is out of our control, I think we need. 42 00:05:05,000 --> 00:05:10,000 We really need people with a macroscope. This is Jesse Osbel's lion. 43 00:05:10,000 --> 00:05:18,000 He said. We got really far with a microscope towards under understanding, and humans and scientists and science got really further along. 44 00:05:18,000 --> 00:05:23,000 Now we need a macroscope How Can We kick a bigger picture of what's going on. 45 00:05:23,000 --> 00:05:29,000 Can we use web scale Tv scale, book scale types of information to try to help us understand? 46 00:05:29,000 --> 00:05:34,000 Our world and i'm jazz that many of the key players have come forward to speak about what they're doing. 47 00:05:34,000 --> 00:05:39,000 They're so inspiring and inspiring to me and then there's firing our staff. 48 00:05:39,000 --> 00:05:48,000 But I think others to go and say, yes, I can go and use these sorts of collections at scale. to go and do different things. 49 00:05:48,000 --> 00:05:53,000 So what's the value to the Internet archive of this just from a completely selfish point of view? 50 00:05:53,000 --> 00:05:56,000 It helps drive us forward to make more useful collections and tools. 51 00:05:56,000 --> 00:06:05,000 And so for that, we're gonna need your feedback let's do what the Internet archive can do to be a better library. 52 00:06:05,000 --> 00:06:11,000 What are the materials, the tools, the structures that you would like to see or platforms? 53 00:06:11,000 --> 00:06:20,000 Is this library as laboratory series useful to you we've seen a large number of people join in on these. 54 00:06:20,000 --> 00:06:27,000 So. but please feedback that that's absolutely essential and thank you, Chris, for making all of this. 55 00:06:27,000 --> 00:06:33,000 Come, come about with a tremendous group of of researchers that have come together. 56 00:06:33,000 --> 00:06:43,000 So looking forward to today. Thanks for that brewster always always good to to hear from you and hear what your what you're thinking about. 57 00:06:43,000 --> 00:06:49,000 So, as we jump in here today, I want to let all of you in the in the audience know what to expect. 58 00:06:49,000 --> 00:06:53,000 So we have 3 segments of talks and each segment will have 3 to 4 talks. 59 00:06:53,000 --> 00:07:03,000 If that's talks in videos so what we'll do is we'll move from talk to talk quickly, and we'll wrap up each segment with a quick round of questions and answers. 60 00:07:03,000 --> 00:07:06,000 So if you have questions again, drop them into the Q. A. 61 00:07:06,000 --> 00:07:11,000 So that we can we'll probably as I said have time for one question allowed. 62 00:07:11,000 --> 00:07:23,000 But our speakers are going to be hanging around, and so, if you drop off additional questions, they might be interested in engaging with you further and answering some of those questions from the from the Q. 63 00:07:23,000 --> 00:07:29,000 And A. So please as you have questions, please do ask them using the Q. A. 64 00:07:29,000 --> 00:07:42,000 Feature. i'll also mention that the duncan dropped a link into the into the chat with the agenda, so that you can follow along the the order that we're going in today so let's 65 00:07:42,000 --> 00:07:56,000 get started up first today. is Kate Milner from the University of Edinburgh Case gonna tell us about the forgotten histories of the mid-century coding boot camp over to Uk. 66 00:07:56,000 --> 00:08:07,000 Hi! everyone, Thanks so much for joining. Just gonna share my screen, and we will get this party started 67 00:08:07,000 --> 00:08:15,000 Okay, So, as Chris mentioned, i'm kate milner I am a Mari Curie postdoctoral fellow at the University of Edinburgh, and I'm. 68 00:08:15,000 --> 00:08:26,000 Just really grateful to Chris and the internet archive for having me to talk about some of the amazing artifacts that I've used in my historical research on electronic data processing schools which are what I consider to be 69 00:08:26,000 --> 00:08:32,000 the mid-century predecessor of the contemporary coding boot camp. 70 00:08:32,000 --> 00:08:46,000 So in the past decade new stories like these have become increasingly familiar. I'm sure most, if not all of you have seen or read an article that talks about learning to code. and what it can accomplish maybe you even wrote one 71 00:08:46,000 --> 00:08:49,000 of them. As a part of my PHD. thesis. 72 00:08:49,000 --> 00:08:55,000 I read over 200 articles over a 10 year period that talked about learning to code. In reading. 73 00:08:55,000 --> 00:09:08,000 All of these articles it became clear that coding has been positioned as a solution to a variety of interconnected social issues, including concerns about Ai gender and racial bias in the Tech sector economic 74 00:09:08,000 --> 00:09:17,000 inequality and skills gaps that have supposedly left millions of well-paid positions unfilled, due to a lack of appropriate training. 75 00:09:17,000 --> 00:09:24,000 In response to this discourse, an entire industry of coding boot camps has developed in the Us. 76 00:09:24,000 --> 00:09:32,000 And across the world. coding boot camps are short-term intensive courses that aim to make professional technologists out of technical novices. 77 00:09:32,000 --> 00:09:40,000 The topics of these pro programs can vary some are focused on data science, some on user experience, some on software engineering. 78 00:09:40,000 --> 00:09:45,000 There's even programs for digital marketing and product management the length of time can vary, too. 79 00:09:45,000 --> 00:09:59,000 Most are around 3 to 4 months, but some even last as long as 2 years. But independent of the topic and the timeline, the promises that these programs make are usually the same stick with us, and the tech career of your dreams will be 80 00:09:59,000 --> 00:10:05,000 right within reach. it's a pretty compelling promise across the Us. 81 00:10:05,000 --> 00:10:15,000 And Canada alone. The coding boot camp industry trained almost 25,000 people a year in 2,020, pulling in almost 350 million dollars. 82 00:10:15,000 --> 00:10:23,000 One of the claims made by coding boot camps is that they offer a novel solution for addressing some of the major problems within tech education. 83 00:10:23,000 --> 00:10:30,000 First, they're a lot shorter than a 4 year college degree. Second, they're supposed to have a much lower sticker. price. 84 00:10:30,000 --> 00:10:36,000 Third, they're supposed to be more accessible to groups that are traditionally excluded from the tech industry. 85 00:10:36,000 --> 00:10:41,000 And finally they're more agile than a university which is supposed to make them responsive to the needs of industry. 86 00:10:41,000 --> 00:10:52,000 With all of these factors combined, they're meant to be a new and ideal way to respond to the challenges of what some called the Fourth Industrial Revolution. 87 00:10:52,000 --> 00:10:57,000 But is any text scholar will tell you the likelihood of something tech related from the current moment. 88 00:10:57,000 --> 00:11:12,000 Being completely brand new is pretty unlikely. So I look to find the historical roots of the coding boot camp, and what I was able to find was surprising even to me, as I was reading through some work in the history 89 00:11:12,000 --> 00:11:18,000 of computing. I came across nathan Encemger's fantastic book about the software industry in the mid twentieth centuries. 90 00:11:18,000 --> 00:11:31,000 In it. He wrote about the prevalence of data. sorry electronic data processing schools for Edp schools in the 1,900 sixtys and seventys. Edp schools were vocational schools that offered short-term 91 00:11:31,000 --> 00:11:35,000 courses in computer programming and much like coding boot camps. 92 00:11:35,000 --> 00:11:39,000 These privately run schools, aim to train people for jobs in industry. 93 00:11:39,000 --> 00:11:49,000 After reading about Edp schools, I started looking for some primary historical materials about them, based on the sources available to me at the University, where I did my PHD. 94 00:11:49,000 --> 00:12:00,000 I was able to find a few newspaper articles. I also found ads for one franchise of schools in particular, which was the electronic Computer Programming Institute or the Ecpi. 95 00:12:00,000 --> 00:12:09,000 In the ads especially, I began to see some similarities between the sales, pitches of Edp schools and the sales, pitches of coding food camps, especially around. 96 00:12:09,000 --> 00:12:15,000 How easy programming can be, and the number of available jobs in the computing industry. 97 00:12:15,000 --> 00:12:28,000 I was hoping to find some more materials about Edp schools and traditional archives like the Charles Babbage Institute at the University of Minnesota, but because most Edp schools were either regional short-lived or 98 00:12:28,000 --> 00:12:34,000 both finding a's for the archives suggested that there wasn't a lot of archival material to be found. 99 00:12:34,000 --> 00:12:45,000 But then, thanks to the junk Mail collection of a computing pioneer named Ted Nelson, and the wonders of digitization, I stumbled upon something that I thought i'd never find which was an 100 00:12:45,000 --> 00:12:50,000 original recruitment brochure from the Ecpi, from the late 1,900 sixtys. 101 00:12:50,000 --> 00:13:03,000 The Ecpi booklet was only 16 pages long, but it was a crucial piece of evidence in my historical research that allowed me to make important links between the past and present of computing. education. 102 00:13:03,000 --> 00:13:12,000 It was really remarkable how many similarities there were between how Edp schools encoding boot camps framed themselves, and what they have to offer. 103 00:13:12,000 --> 00:13:24,000 First. both kinds of organizations framed a career in computing as accessible to anyone and a good way to go against the threat of automation. 104 00:13:24,000 --> 00:13:33,000 Edp schools also focused on the wide availability of computing jobs much like contemporary displayions of the Skills Gap. 105 00:13:33,000 --> 00:13:38,000 They also focused on computing as an inclusive career pathway for women and people of color. 106 00:13:38,000 --> 00:13:44,000 This was actually pretty remarkable for the 1,900 sixtys at a 10 students showcased for successful placements. 107 00:13:44,000 --> 00:13:49,000 Over half of them were women or people from minoritized ethnic groups. 108 00:13:49,000 --> 00:13:57,000 Finally, the Ecpi booklet underscored their links with industry Much like coding through camps, do today. 109 00:13:57,000 --> 00:14:03,000 Of course, the image that an organization presents to the world and its reality can be starkly different. 110 00:14:03,000 --> 00:14:11,000 So I then began to look into whether the image that Ecpi presented connected with other parts of the historical record. 111 00:14:11,000 --> 00:14:17,000 To do this, I turn to another fantastic Internet archive resource which was their digitized collection of computer world. 112 00:14:17,000 --> 00:14:27,000 To have this as a searchable digital archive was really incredible because it didn't contained some remarkably rich material that I probably wouldn't have found otherwise. 113 00:14:27,000 --> 00:14:39,000 One of the key directions that the Computer World Archive encouraged me to pursue in my research was the disconnect between the claims of inclusivity made about the computing industry at the time and the reality of the 114 00:14:39,000 --> 00:14:47,000 computing labor market. There was a ton of coverage in computer world about Edp schools in the 1,900 sixtys and seventys, and a lot of it highlighted. 115 00:14:47,000 --> 00:14:59,000 Many of the issues. With these organizations a series of articles published from 1,969 to 1,970, showed how these claims of accessibility were not necessarily so true. 116 00:14:59,000 --> 00:15:01,000 Despite a purported programmer storage. 117 00:15:01,000 --> 00:15:07,000 These articles illustrated how black programming trainees found it next to impossible to get hired for programming jobs. 118 00:15:07,000 --> 00:15:20,000 This was a very different story than the one told by the Ecpi booklet with the computer World Archive, also highlighted, was how problematic. So many of these schools were in Nathan encemger's. 119 00:15:20,000 --> 00:15:22,000 Book he discussed how many companies had employed A. No. 120 00:15:22,000 --> 00:15:27,000 Edp School graduate policy due to the variable quality of these schools. 121 00:15:27,000 --> 00:15:38,000 Articles that I found in computer world get clear insight into how these organizations operated, and the dire situations that some of their students were left in. 122 00:15:38,000 --> 00:15:43,000 So you may be asking yourself, Okay. So what does the history of Edp schools in the 1960 S. 123 00:15:43,000 --> 00:15:54,000 And seventies have to do with today. there's A reason that this history is worth paying attention to, and that's because it threatens to repeat itself in the current moment. 124 00:15:54,000 --> 00:16:04,000 The Ecpi booklet showed us that there are a lot of similarities between how coding boot camps and Edp schools presented both themselves and the potential benefits of learning to code. 125 00:16:04,000 --> 00:16:08,000 But the Computer World Archive shows us that there may be some other similarities as well. 126 00:16:08,000 --> 00:16:15,000 The story of Edp schools didn't end well either for many of the students, or for the schools themselves. 127 00:16:15,000 --> 00:16:18,000 They were shut down and liquidated by the mid 1,900 seventys. 128 00:16:18,000 --> 00:16:22,000 After a Federal investigation, cracked down on fraudulent vocational schools. 129 00:16:22,000 --> 00:16:27,000 Others just went out of business because no one wanted to go anymore. 130 00:16:27,000 --> 00:16:38,000 Of course, the future of coding boot camps has yet to be written, but there are some warning signs that there might be more similarities between food camps and Edp schools than we might hope one of my goals for this 131 00:16:38,000 --> 00:16:45,000 project was to point out how the past lives on in the present, so that the mistakes of the past can hopefully be avoided. 132 00:16:45,000 --> 00:16:56,000 If you've enjoyed this talk, I have a journal article coming out this fall in information, a culture that talks about the commonalities between coding boot camps, and Edp schools at length so please do keep an eye 133 00:16:56,000 --> 00:17:02,000 up. Thanks so much for your attention, and If You'd. like to get in touch, or put questions in the Q. A. 134 00:17:02,000 --> 00:17:11,000 Please do. Thanks so much, Kate. really appreciate that excellent overview. We have some questions that have come in 135 00:17:11,000 --> 00:17:16,000 I'm. gathering those if others in the audience have questions. Please do use the Q. and A. 136 00:17:16,000 --> 00:17:21,000 Feature to to drop those off and what we'll do now is move on to the next talk. 137 00:17:21,000 --> 00:17:26,000 And so up. Next is Tom Galley, from the University of Tokyo. 138 00:17:26,000 --> 00:17:31,000 Now Tom is is in Japan, and he will be watching this time shifted. 139 00:17:31,000 --> 00:17:34,000 So this is a wave to Tom. In the future he will be watching. 140 00:17:34,000 --> 00:17:40,000 This is in the past. i'm getting a little confused but let's let time explain. 141 00:17:40,000 --> 00:17:43,000 In this video. some of his research about Japan. 142 00:17:43,000 --> 00:17:47,000 So, Kaitlyn, you wanna you wanna take it away. 143 00:17:47,000 --> 00:17:54,000 Japan is now a modern country, not too different from many other places. 144 00:17:54,000 --> 00:18:03,000 But when it opened up to the world in the middle of the nineteenth century, Japan seemed exotic and mysterious to the first Western visitors. 145 00:18:03,000 --> 00:18:13,000 People who couldn't visit were even more curious about the country so to satisfy that curiosity, and to make some money. 146 00:18:13,000 --> 00:18:17,000 Those visitors wrote books about Japan for the people back home. 147 00:18:17,000 --> 00:18:22,000 Japan as they saw it, is a collection of excerpts from those books. 148 00:18:22,000 --> 00:18:42,000 They show how those visitors, mostly American, or British described the country and its people, the cities, the countryside, the clothing, the religions, the theater and festivals, even the smells they tell about writing in rick 149 00:18:42,000 --> 00:18:56,000 Shaw's attending weddings meeting prostitutes, experiencing earthquakes, Got the idea to create Japan as they saw it after I came across some of those old books at the Internet Archive. 150 00:18:56,000 --> 00:19:02,000 You see, I myself moved to Japan in 1,983 from California. 151 00:19:02,000 --> 00:19:17,000 I've lived here ever since, so it was fascinating for me to think about how my own first impressions of the country, you know, and what I had told my friends and family back home about it, compared with what people had 152 00:19:17,000 --> 00:19:27,000 written about Japan a century earlier, thought a collection of those earlier impressions might be interesting for others to read, too. 153 00:19:27,000 --> 00:19:42,000 So I first prepared a list of over 240 books about Japan at the Internet Archive, published between 1,855, and 1,912 went through those books and chose passages that seemed interesting. 154 00:19:42,000 --> 00:19:49,000 Or amusing, and I put them all on This website it's also available as an ebook. 155 00:19:49,000 --> 00:20:01,000 Each excerpt is linked to the original book at the Internet Archive, and, like those books at the Archive, the Japan, as they saw it, is free for anyone in the world to read. 156 00:20:01,000 --> 00:20:10,000 Thank you Once again, Internet Archive, from making this possible 157 00:20:10,000 --> 00:20:15,000 And I think so. You, Tom, for putting that video together, and if anyone has a question. 158 00:20:15,000 --> 00:20:22,000 Yes, We've already reached out to Tom and asked him. if we would create other videos for other parts of our collection, because I thought that was really great. 159 00:20:22,000 --> 00:20:30,000 Also stay tuned. Tom has A is gonna close out our show with another video on forgotten novels of the nineteenth century, which is really fun. 160 00:20:30,000 --> 00:20:43,000 Research. but up next we have a one of my favorite people on the planet a colleague who I've worked with for years in in a variety of ways, and i'm really pleased to welcome him to the to the 161 00:20:43,000 --> 00:21:02,000 stage the virtually to tell us about his research so that's a rod page from the University of Glasgow, talking about the bibliography of life over to you Rod 162 00:21:02,000 --> 00:21:06,000 Okay, Thank you very much for that lovely introduction. Chris. 163 00:21:06,000 --> 00:21:11,000 So i'm going to talk about the bibliography of life. 164 00:21:11,000 --> 00:21:15,000 So I guess what I should do is define what I mean by that. 165 00:21:15,000 --> 00:21:19,000 So of people look in the field of biodiversity and taxonomy. 166 00:21:19,000 --> 00:21:29,000 Have the stream of the bibliography of life and Basically, it's access to every taxonomic paper published on every species ever described. 167 00:21:29,000 --> 00:21:34,000 So to give you a sense of the scale. we Think there are probably about 2 million species described on the planet today. 168 00:21:34,000 --> 00:21:39,000 There's probably about 10 million in total so lots to be discovered. 169 00:21:39,000 --> 00:21:50,000 So we're thinking in terms of hundreds of thousands perhaps even a 1 million. also publications that describe the species times when we talk about this notion of a biography of life. 170 00:21:50,000 --> 00:21:55,000 Some people get upset because it seems to focus on on taxonomy. 171 00:21:55,000 --> 00:22:07,000 And what's so special that taxonomy what about all these major biomedical databases, such as pub, mid, for example, lots of information on on medicine, and so on to make the case that there's something special 172 00:22:07,000 --> 00:22:13,000 about about taxonomy, taxonomy and biodiversity in many ways. 173 00:22:13,000 --> 00:22:18,000 It's not big data. it's long data we have lots not so long. 174 00:22:18,000 --> 00:22:23,000 Tails. What you can see in the slide here on the right is 175 00:22:23,000 --> 00:22:29,000 Do a summary of the size of pages in Wikipedia for different mammal species. 176 00:22:29,000 --> 00:22:35,000 So some mammals, for example, lions and mice, charismatic animals, or medically important animals? 177 00:22:35,000 --> 00:22:39,000 Have we really large Wikipedia entries down the bottom of this chart? 178 00:22:39,000 --> 00:22:43,000 You can see the rest, literally thousands of pages of Wikipedia. 179 00:22:43,000 --> 00:22:55,000 On these mammals that are very, very small. So for many species on the planet, almost all that we know about in terms of the ecology, the morphology, what they do, where they are will come from the taxonomic 180 00:22:55,000 --> 00:23:03,000 literature, no taxing itself. she's also very much a subject for her. 181 00:23:03,000 --> 00:23:12,000 So we have some very large prominent journals, such as you Taxa, that published tens of thousands of new species descriptions. 182 00:23:12,000 --> 00:23:17,000 Also there is a very, very long tail in many cases of often very small journals. 183 00:23:17,000 --> 00:23:22,000 That are often very niche in terms of the textbook focus or the geographic focus. 184 00:23:22,000 --> 00:23:27,000 And again there will be a reservoir of lots of information about these species. 185 00:23:27,000 --> 00:23:33,000 So again, if we're going to get a nice comprehensive biography of life, we're gonna have to go hunting for those. 186 00:23:33,000 --> 00:23:39,000 Now those of you I guess, like like myself, from Chris, have been around for a while in this area. 187 00:23:39,000 --> 00:23:42,000 You might be thinking, hang on a second. This sounds already familiar. 188 00:23:42,000 --> 00:23:47,000 What about the biodiversity heritage library to beh shell isn't this what they're doing? 189 00:23:47,000 --> 00:23:52,000 And we had a really interesting presentation a couple of weeks ago, about them. 190 00:23:52,000 --> 00:24:02,000 Well, there are some overlap, but bh shell suffers from what I'm gonna call the Mickey Mouse Gap, which is this thing that the impact of copyright had a huge sort of dampening effect on vhs 191 00:24:02,000 --> 00:24:07,000 coverage. What i'm trying to capture in this diagram here is in blue. 192 00:24:07,000 --> 00:24:11,000 You can see for each decade how many publications are out there describing these species of animals. 193 00:24:11,000 --> 00:24:17,000 This is sort of lower bound on that, and in red is at the articles in Dhl. 194 00:24:17,000 --> 00:24:26,000 And you can see that the coverage will be actually dpped dramatically after about 1,923 when copyright in America, Hick kicks in and it's also starts to increase. 195 00:24:26,000 --> 00:24:30,000 A bit. But this is big area of blue and that's the stuff that I'm. 196 00:24:30,000 --> 00:24:37,000 After all these descriptions of species, mostly in the twentieth century, the out in Bhl. 197 00:24:37,000 --> 00:24:42,000 So once solution is to try and capture these these articles and put them some way safe. 198 00:24:42,000 --> 00:24:48,000 Now, a lot of this material has actually been digitized and It's really available in various websites. 199 00:24:48,000 --> 00:24:55,000 So I spent some time trying to grab this together. and i've sort of created almost a mini dhl on the Internet archive clickings, publications. 200 00:24:55,000 --> 00:25:05,000 They're full of beautiful photographs and species what's the information on geography and maps and climate, and also the people who study this species. 201 00:25:05,000 --> 00:25:10,000 It's just you try and gather as many of these publications as possible. 202 00:25:10,000 --> 00:25:16,000 I briefly talked about the the long tail of taxonomic publications, or these small journals. 203 00:25:16,000 --> 00:25:20,000 Many of these journals are as endangered as species, or they taxonomists themselves. 204 00:25:20,000 --> 00:25:24,000 There are generals that just simply vanish like this one of the left to the recent Italian journal. 205 00:25:24,000 --> 00:25:34,000 This is gone. they're also general set vanish but sort of come back as zombies taken over play by bad people on the right. 206 00:25:34,000 --> 00:25:44,000 There's a journal that used to publish on butterfly taxonomy the person who ran that eventually sort of came up was too hard to sort of maintain a taxonomic journal. 207 00:25:44,000 --> 00:25:48,000 Somebody else came along, took that domain name, and is now publishing things that anything. 208 00:25:48,000 --> 00:25:53,000 But butterfly taxonomy. and this is where the the way back machine is being incredibly spoiled. 209 00:25:53,000 --> 00:26:03,000 To try and retrieve. These are journals and the Content, and also discover history of some of these journals, which have been hijacked. 210 00:26:03,000 --> 00:26:11,000 One of those exercises just getting the content. one thing that that in a sense an archive, probably isn't particularly greater is is metadata. 211 00:26:11,000 --> 00:26:14,000 You can put quite a bit in. but for really detailed metadata. 212 00:26:14,000 --> 00:26:19,000 I've turned to wiki data so so what you Dixa has this extraordinary resource. 213 00:26:19,000 --> 00:26:30,000 It's a interesting interface and there's an enormous community of people contributing, So, for example, if you have an article like this is one that's in English and Chinese you can have titles in both 214 00:26:30,000 --> 00:26:36,000 languages. I think that really sort of motivates my use of Wiki data is to try and connect all these things together. 215 00:26:36,000 --> 00:26:43,000 So guess what i'm sort of aiming for ultimately is something a bit like this. there's there's a link. 216 00:26:43,000 --> 00:26:46,000 I think this is also gonna be in the the show notes. 217 00:26:46,000 --> 00:26:54,000 This is a little app that I built, that essentially talks to Wikipedia, and you can look at information, say, on species. 218 00:26:54,000 --> 00:27:03,000 On the left. you can get information on journals that publish papers on this species, and you can also then bounce to the the taxonomers, the people actually doing the research. 219 00:27:03,000 --> 00:27:12,000 So the goal is to have this kind of network of linked things, images of species, general articles, and the textbooks themselves, and these little thumbnails. 220 00:27:12,000 --> 00:27:15,000 You can see in these pictures here. they were coming from the Internet Archive. 221 00:27:15,000 --> 00:27:22,000 So these are articles that are freely available that people can read and see pictures of the species. 222 00:27:22,000 --> 00:27:33,000 So. So we're next for for this project I guess i'm very simple minded in a way one of it is just to try and get more content and get as much that content digitally preserved as possible which in an 223 00:27:33,000 --> 00:27:45,000 archive is one obviously to go down that and to use the way back machine to go and try and retrieve some of these journals that have come and gone, and the second stage is just to try and link with these things together. 224 00:27:45,000 --> 00:27:51,000 So we don't just have these digital artifacts that are sitting there preserved in some way. we can actually dance between them. 225 00:27:51,000 --> 00:28:05,000 Explore those connections. Thanks for much for your time. Thank you so much, Rod, and if you would stay on screen, and if Kate would come back on, I would love to do a quick round of questions I do have a 226 00:28:05,000 --> 00:28:12,000 question for each of you, and if again, if there are additional questions, do please drop them in the in the Q. A. 227 00:28:12,000 --> 00:28:17,000 But for Kate, the question is how much stuff is in Ted Nelson's junk mail collection. 228 00:28:17,000 --> 00:28:23,000 And how did you search through? It Seems like finding a needle in a haystack. 229 00:28:23,000 --> 00:28:28,000 So actually, I have to. I have to thank Google for that one. 230 00:28:28,000 --> 00:28:34,000 I was just looking for for information on Eddp schools. 231 00:28:34,000 --> 00:28:37,000 Actually I kind of don't was it I think it was just Google. 232 00:28:37,000 --> 00:28:47,000 I was looking for Ecpi. I was trying to find information on Ecpi, and I because you know the guide had been Ocr 233 00:28:47,000 --> 00:28:55,000 It showed up, and I just remember being like who's what is this junk mile junk mail collection like what is even happening. 234 00:28:55,000 --> 00:28:59,000 And then I clicked through, and I was like, oh, like this is miracle! 235 00:28:59,000 --> 00:29:06,000 Holy cow I can't believe that there's this call thing, and you know more than the actual booklet itself. 236 00:29:06,000 --> 00:29:14,000 There were other materials, including some news articles that had come with the packet that the sort of had encouraged sort of said, like, Oh, you know. 237 00:29:14,000 --> 00:29:20,000 Look how great, you know, learning to code or you know computer learning computer programming is. 238 00:29:20,000 --> 00:29:25,000 And you're gonna make so much money and all of that so 239 00:29:25,000 --> 00:29:29,000 I have not actually seen enough of, I think the junk mail file but 240 00:29:29,000 --> 00:29:37,000 I should definitely just blocking one day that's I I bet many of the researchers here in the audience have the same experience. I know. 241 00:29:37,000 --> 00:29:50,000 I did, you know, doing a Google search for something kind of upcoming in your in your discipline, and like, lo and behold, there it is somewhere at the Internet Archive, like I wouldn't have never thought to go look in the ted nelson junk 242 00:29:50,000 --> 00:29:53,000 mail collection for this thing that I needed. Yeah, thank you for that, Kate. 243 00:29:53,000 --> 00:29:59,000 A question for you rod from we'll we'll pick this one up here from from martin cafatovic. 244 00:29:59,000 --> 00:30:12,000 Now I know that you're not a a copyright lawyer. but I bet you have some opinions on what the question is that Martin is asking, which is that current global copyright Laws are terrible for science and for culture, But people go to 245 00:30:12,000 --> 00:30:16,000 jail, or worse for violating them. Do you have any thoughts on how institutions can risk? 246 00:30:16,000 --> 00:30:31,000 Manage that. Well, that's that's that's a question I guess what one thing did to point out so most of the content that i'm going after I mean i've thought about is it's quote unquote freely 247 00:30:31,000 --> 00:30:34,000 available, or some of that is explicitly, freely licensed. 248 00:30:34,000 --> 00:30:44,000 So I think I know there's an element. of If you model is negotiate with the providers to sign agreements to get the content which I guess is the way Bh ol does which is all proper that 249 00:30:44,000 --> 00:30:54,000 mitigates the risk. i've sort, of taken the approach well, if it's not kind of like behind the pay walls or not expressly saying, do not have this then, we can at least try and make copies of it so. 250 00:30:54,000 --> 00:30:58,000 That it doesn't go away and again we've seen repeatedly. Yeah. 251 00:30:58,000 --> 00:31:01,000 By the time we could have had these negotiations. 252 00:31:01,000 --> 00:31:06,000 Some of these journals are just gone. so I I take Martin's point. 253 00:31:06,000 --> 00:31:14,000 I guess that's an interesting question. for the interactive itself it's an institution that presumably tries to mitigate the risk of dealing with this issue. 254 00:31:14,000 --> 00:31:24,000 And leila bailey from the policy expert at the Internet Archive has entered the chat and and dropped off some some thoughts as did Brewster. 255 00:31:24,000 --> 00:31:27,000 So we'll let that conversation continue on asynchronously. 256 00:31:27,000 --> 00:31:31,000 Kate Rod, Thank you much for your presentations today. 257 00:31:31,000 --> 00:31:35,000 If you have additional questions for either one of them drop them into the into the Q. 258 00:31:35,000 --> 00:31:45,000 And a and we'll pick up with our next Speaker and so I would like to welcome Michelle Alexopolis to the to the to the screen today. 259 00:31:45,000 --> 00:31:48,000 And Michelle is gonna tell you about her research. 260 00:31:48,000 --> 00:32:02,000 Looking at expressions on the Federal chairs about the members of the Board of Federal Reserve Board chairs, and how that their expressions affect financial markets. 261 00:32:02,000 --> 00:32:07,000 So over to you, Michelle. Thank you very much. 262 00:32:07,000 --> 00:32:14,000 I would like to thank the organizers, obviously, for putting together such an interesting eclectic group of papers. 263 00:32:14,000 --> 00:32:19,000 So I would. as Krista said, I wanna talk about more than words. 264 00:32:19,000 --> 00:32:32,000 Basically, this is an analysis of the Federal reserve board Chairs communication during Congressional testimony, based on a project that I'm doing with members of Bank of Canada specifically Asian Fen Han Lesky pressed off and Shu Jang the link 265 00:32:32,000 --> 00:32:38,000 to our newly released working paper is available there now in terms of the product data sources. 266 00:32:38,000 --> 00:32:42,000 This wouldn't have happened unless we had access to a tremendous amount of resources. 267 00:32:42,000 --> 00:32:47,000 We're looking for audio video as well as textual from the Internet Archive. 268 00:32:47,000 --> 00:32:58,000 We've primarily used the Tv archive content these are videos from Cspan, Cnbc Bloomberg Tv and the house and Senate videos and where there is polls in that coverage. 269 00:32:58,000 --> 00:33:07,000 We've then turned to c-span archives Youtube videos, textual sources that would give us transcript information. 270 00:33:07,000 --> 00:33:16,000 And what we did is we analyzed all of this and then tried to blend it with basically tick-by tick stock data that came from places like Warden and refuitive etc. 271 00:33:16,000 --> 00:33:19,000 So the over overview of the project is pretty straightforward. 272 00:33:19,000 --> 00:33:23,000 What we wanted to do is examine the importance and the impact of the Us. 273 00:33:23,000 --> 00:33:35,000 Federal reserve communication, we're going to measure aspects of the communication using Ai natural language processing and other tools, and that we want to test if and how the words chosen their body language and their tone of voice is 274 00:33:35,000 --> 00:33:42,000 actually affecting markets. So when we think about policy communication, I understand we have a very large audience. 275 00:33:42,000 --> 00:33:51,000 When I talk about monetary policy and policy communications, i'm talking about things such as interest rate movements for the policy rate for guidance, quantitative easing or bank balance. 276 00:33:51,000 --> 00:33:57,000 Sheet operations. These things have very large impacts on the welfare of businesses and individuals. 277 00:33:57,000 --> 00:34:02,000 In our economy, not just in the United States, but because they're so large around the world. 278 00:34:02,000 --> 00:34:06,000 Now these policies are often fairly complex now to be effective. 279 00:34:06,000 --> 00:34:09,000 Central bank communications need to be accurate. They need to be clear. 280 00:34:09,000 --> 00:34:14,000 They need to be perceived as credible. and they need to reach their target audience. 281 00:34:14,000 --> 00:34:22,000 And it's true that when the central bank takes actions or releases communications, we at all are not necessarily watching the actual release. 282 00:34:22,000 --> 00:34:35,000 But we're also getting our information through media. so we have 2 different types of channels where we get this information, it could be direct communication, such as the press conferences themselves or through watching things like the testimony or even the 283 00:34:35,000 --> 00:34:47,000 press releases, such as the Fafc statements that were released last week, or we can have the indirect types of communications that we often see captured by the Internet archives Tv archive which deals with the coverage on 284 00:34:47,000 --> 00:34:53,000 Cnbc fox news, or even things like Bloomberg Tv. 285 00:34:53,000 --> 00:34:57,000 This is also then amplified by sort of traditional things, such as the New York Times and and Twitter. 286 00:34:57,000 --> 00:35:06,000 Now, when we think about communication communication, is obviously more than words, and this has been popularized for a very long time ever since the 1,900 seventys, I'm. 287 00:35:06,000 --> 00:35:13,000 Sure some of you have heard that the statement that over 90% of communication it comes from your body, language, and tone of voice. 288 00:35:13,000 --> 00:35:22,000 Now, although the Morabian studies are probably not getting exactly the the right numbers, in terms of which quantities the idea is pretty much accepted. 289 00:35:22,000 --> 00:35:32,000 So you can see here even a few pictures of Ben Bernanke and Janet Yellen, who are 2 of the Federal reserve chairs, and you can see sometimes they look very relaxed, and sometimes they look a lot more 290 00:35:32,000 --> 00:35:37,000 perplexed or concerned. So we want to think about how that actually influences people. 291 00:35:37,000 --> 00:35:44,000 And why does body language potentially matter body, language, and tone of voice, could matter through a direct channel? 292 00:35:44,000 --> 00:35:50,000 So a traitor or us individuals may be watching directly and then we take actions based on what it is we're hearing. 293 00:35:50,000 --> 00:35:57,000 We're either actively looking for something or it may affect how we feel about something or sentiment, or confidence. 294 00:35:57,000 --> 00:36:08,000 If we're, looking at indirect channels because these get picked up and amplified through other Tv stations and other news outlets, what we can basically get is that the fed has a communication the journalists. 295 00:36:08,000 --> 00:36:11,000 Or the analysts are watching it, and then they're releasing information. 296 00:36:11,000 --> 00:36:15,000 And now those of us who are receiving that information, it could be broken. 297 00:36:15,000 --> 00:36:19,000 Telephone may not be, but we'll take actions upon that themselves. 298 00:36:19,000 --> 00:36:29,000 So you can see here, for example, a quote that came from the Cnbc Street signs that was talking about when Janet Yellen looked insecure when she looked frustrated when she looked angry and even when Ben 299 00:36:29,000 --> 00:36:34,000 bernanke's voice potentially got shaky at at various times. 300 00:36:34,000 --> 00:36:43,000 And then there's also the impact of algorithmic traders, where now we have an automated channel where they may be taking a lot of these signals, analyzing them and trading on that behalf. 301 00:36:43,000 --> 00:36:45,000 So what our research tries to do is look at these different channels. 302 00:36:45,000 --> 00:36:48,000 We're not going to take a stand on which one is most important. 303 00:36:48,000 --> 00:36:58,000 But we want to take a look at fed share testimonies, and this is a little different than some of the other papers that have been looking at fed communications because we're focusing on these 32 testimonies which will 304 00:36:58,000 --> 00:37:03,000 expand over the summer and we're looking at the 3 different dimensions of communications. 305 00:37:03,000 --> 00:37:14,000 All at once. Now, what's interesting about these communications is There's also a prepared portion of this, and we have unscripted question and answers and people can obviously respond to those things in different ways. 306 00:37:14,000 --> 00:37:19,000 We will also have the responses from the Senators and the Congressional Representatives. 307 00:37:19,000 --> 00:37:27,000 Now why testimonies you might ask well first of all they're widely washed by fed watchers and investors. 308 00:37:27,000 --> 00:37:31,000 It's covered by a lot of news media it doesn't occur on the same day as an actual policy announcement. 309 00:37:31,000 --> 00:37:35,000 So you can focus a little bit more on disentangling just the communications component. 310 00:37:35,000 --> 00:37:44,000 And, of course, a high quality Tv footage and transcripts that are available from the archives has made this actually a doable thing at the moment. 311 00:37:44,000 --> 00:37:46,000 So the structure of the semi-anual testimony. 312 00:37:46,000 --> 00:37:48,000 These things are about 2 to 3 h, and they come in pairs. 313 00:37:48,000 --> 00:37:54,000 So the semi-annual testimonies you end up with one before the House and one before the Senate. 314 00:37:54,000 --> 00:38:00,000 Banking committee. You have the prepared remarks released before things be in. Then you'll have some welcoming statements. 315 00:38:00,000 --> 00:38:05,000 The prepared remarks in the Q. and A. What we can do then is we can disentangle a different components. 316 00:38:05,000 --> 00:38:09,000 Here we have transcripts which are going to give us the actual words that are spoken. 317 00:38:09,000 --> 00:38:16,000 We have the audio component, and then we have the video component which will allow us to look at how facial expressions are formed. 318 00:38:16,000 --> 00:38:21,000 So here the 3 different ways, basically from the text without getting into details. 319 00:38:21,000 --> 00:38:31,000 We're using a fine-trained bert model which is a language model that sort of stayed in the art right now from the voice we're basically forcing alignments with transcripts and we're 320 00:38:31,000 --> 00:38:34,000 extracting pitch for each speaker. with a tool called Pr. 321 00:38:34,000 --> 00:38:41,000 And for the facial expressions we're using things that will be used in laboratories in psychology. 322 00:38:41,000 --> 00:38:54,000 So, faith, reader, for example, we're using microsoft azure, and we're matching these to look at facial action units and then using mappings that have been done with psychologists before into figuring out what kind of 323 00:38:54,000 --> 00:38:57,000 negative emotions might be being expressed by the Federal reserve chairs. 324 00:38:57,000 --> 00:39:00,000 During this time. Now all in all, we take all of this. 325 00:39:00,000 --> 00:39:04,000 We then have to mash it in with financial market. data might be very careful about the timing of this. 326 00:39:04,000 --> 00:39:10,000 So we use the Cnbc. live coverage and We're watching what the values of the S. 327 00:39:10,000 --> 00:39:14,000 And P. 500 are in order to do some of that crosswalk in the timestamps. 328 00:39:14,000 --> 00:39:28,000 Then what we do is we basically create this giant database We're going to use local projection regressions and difference in different type projections in order to look at the impact of these different channels of emotions on the outcomes which would 329 00:39:28,000 --> 00:39:32,000 be, say, the change in the S. and p 500 or what's happening with the vix. 330 00:39:32,000 --> 00:39:37,000 Now, what we typically find in this is just one of the examples is, yes, indeed, the S. 331 00:39:37,000 --> 00:39:40,000 And P. 500, and the vix tend to respond in the ways that you might expect. 332 00:39:40,000 --> 00:39:53,000 So if they're actually relaxed and happy not negative, you end up seeing the values of the stock markets going up in the short run, and the value of the vix, which is usually considered to be a fear index, going down Now, what 333 00:39:53,000 --> 00:40:05,000 you also see, is there are differences across different topics, and, as you might expect, the ones that the fed has most oversight or control on such as monetary policy will have the largest impact on markets over the time period Now for some reason findings 334 00:40:05,000 --> 00:40:11,000 today. I did promise, Chris. I would try to get in on Time is what we did find is these different dimensions. 335 00:40:11,000 --> 00:40:18,000 They don't all act at the same time in a highly correlated fashion, text, voice, and facial emotions, though, do tend to move financial markets. 336 00:40:18,000 --> 00:40:24,000 I showed you a couple of examples for the short run responses. but the impact seems to grow stronger in the days following these testimonies. 337 00:40:24,000 --> 00:40:34,000 There's differences in the amount of soft information how they can differ across topics, and we also are starting to see some patterns that we're going to be verifying over the summer as to whether or not some of these 338 00:40:34,000 --> 00:40:37,000 responses differ from chat, fed, different fed shares. 339 00:40:37,000 --> 00:40:44,000 We also have the Congressional members emotions, and sometimes they will significantly move markets as well. alright. 340 00:40:44,000 --> 00:40:47,000 So with that, I say, thank you comments and questions are welcome again. 341 00:40:47,000 --> 00:40:51,000 There's a link to working paper and you can reach out to me via email. 342 00:40:51,000 --> 00:40:57,000 I'll be happy to have a conversation thank you very much, thank you, Michelle. 343 00:40:57,000 --> 00:41:00,000 We? We have gotten a couple of questions and come in to me. 344 00:41:00,000 --> 00:41:04,000 Again, if you have additional questions for michelle please drop them off in the Q. A. 345 00:41:04,000 --> 00:41:14,000 But what we'd like to do now just move on to our next speaker. and what again we'll we'll pick up questions with both salute and michelle in just a minute. 346 00:41:14,000 --> 00:41:19,000 But today, right now, I want to welcome. So wot along to the screen. 347 00:41:19,000 --> 00:41:30,000 So we works for the Internet archive he's part of the way back machine team doing amazing things he's gonna tell you walk you through some of the work that he's doing on metadata behind the scenes with 348 00:41:30,000 --> 00:41:36,000 our web archive. So So we would please over to you. 349 00:41:36,000 --> 00:41:42,000 Allow me to share my screen 350 00:41:42,000 --> 00:41:46,000 Take a while 351 00:41:46,000 --> 00:41:54,000 Okay, This top one share this is my screen. visible. 352 00:41:54,000 --> 00:42:07,000 Looks great, awesome, Thank you So i'm solid alums a webinar data scientist or webex machine at the Internet Archive, and to see on my screen is the help text of a tool called cdx 353 00:42:07,000 --> 00:42:12,000 summaries I mean my current working breath. We have a file for samples or Cds or Gc. 354 00:42:12,000 --> 00:42:18,000 Which is a compressed index of a bunch of web app type files. 355 00:42:18,000 --> 00:42:22,000 But we don't know what was in those that archives 5 for a while. 356 00:42:22,000 --> 00:42:28,000 Hopefully. when we run this index through our cdx summary tool, we'll get to know what is entirely there. 357 00:42:28,000 --> 00:42:32,000 So I'm going to run this command here and it'll take a while to complete. 358 00:42:32,000 --> 00:42:37,000 So we'll come back to it later. let's let's go to Internet Archives home. 359 00:42:37,000 --> 00:42:44,000 Page. Yeah, we see a bunch of collections listed. Some are collections of books and audio and video. 360 00:42:44,000 --> 00:42:55,000 Some are web collections let's open one of these collections, and in this case it is Ukrainian culture on heritage collection. 361 00:42:55,000 --> 00:43:00,000 It's called through this collection to see some interesting colorful thumbnails. 362 00:43:00,000 --> 00:43:07,000 The meaningful titles, and you have a rough idea of what to see what to get in in this collection. 363 00:43:07,000 --> 00:43:22,000 Now let's go to a web. collection which is a collection of work, file an arbitrary number of web captures are packaged together, and in this case we see a list of tombstones here as 364 00:43:22,000 --> 00:43:28,000 thumbnail and titles look very similar as if they were generated by machine. 365 00:43:28,000 --> 00:43:37,000 It doesn't tell a lot there is an about tab in this collection view. But we don't see much there are few lines of metadata here and at all. 366 00:43:37,000 --> 00:43:40,000 And this is the problem that you are going to start here. 367 00:43:40,000 --> 00:43:50,000 In the next few days or weeks, when the testing is done in, the changes are merged to the main site. 368 00:43:50,000 --> 00:43:59,000 This case will look something like that. Well, we'll get to know how many you capture are in this collection. 369 00:43:59,000 --> 00:44:05,000 How many unique uris were captured. how many of those youri's belong to say Html pages are 200. 370 00:44:05,000 --> 00:44:15,000 Okay, for example, or how many of those Url have 0 cats and 0 queries? that is, root pages versus deepling. 371 00:44:15,000 --> 00:44:19,000 We can also get simple and spread out the collection. 372 00:44:19,000 --> 00:44:23,000 So this collection started in 2,016 and stop there. 373 00:44:23,000 --> 00:44:26,000 No activity in 2,017, and then it started again. 374 00:44:26,000 --> 00:44:32,000 2,018, and since then it's been constant constantly archiving every single month. 375 00:44:32,000 --> 00:44:39,000 Then we also have a bunch of top folks that are contributed into this collection with maximum number of capture. 376 00:44:39,000 --> 00:44:44,000 And one cool thing that it the tool brings up is a bunch of sample. 377 00:44:44,000 --> 00:44:51,000 Uris from this collection. so wackf is a bundle of a number of your either. 378 00:44:51,000 --> 00:44:59,000 You don't know what is inside. in there and this allows us to kind of, you know, click on one of these links, and see how, when those pages are archived, so it helps us doing. 379 00:44:59,000 --> 00:45:06,000 Qa. for example, all these numbers up here. They also allow us to gain insight, and learn. 380 00:45:06,000 --> 00:45:12,000 Sometimes, if the numbers are really off, we look like there was some problem in our calling process and go back and pick them. 381 00:45:12,000 --> 00:45:16,000 So how this whole page was rendered is basically backed by A. 382 00:45:16,000 --> 00:45:25,000 Json file, and this Jason file is generated using a command line tool called Cdx Summary, and we have open source. 383 00:45:25,000 --> 00:45:28,000 This tool, so you can use this tool to generate human readable. 384 00:45:28,000 --> 00:45:40,000 You know, summaries or machine Read people, Json files, and if you have a Json file you can render it in Sdml using this web component that we made available. 385 00:45:40,000 --> 00:45:54,000 So with that, let's go back to the command line and See what happened to the local cdf file that we had, and it turned out it did generate a very nice human readable somebody off the the cdfs that we 386 00:45:54,000 --> 00:46:00,000 had. Now we get to know what is inside in there, who, when it was captured, and how many catches up there, and so on, and so forth. 387 00:46:00,000 --> 00:46:07,000 And we can even click on one of these links to to open them in a browser and see a hard for that type. 388 00:46:07,000 --> 00:46:15,000 But by demonstrating it on command line. What we really illustrated is this: pool is not tied to Internet archives. 389 00:46:15,000 --> 00:46:18,000 It is an independent tool. Actually, we are just using an internal archive. 390 00:46:18,000 --> 00:46:22,000 Anyone can use it in their own Webcastle collection. 391 00:46:22,000 --> 00:46:32,000 So with that I will recap. We basically created a command line tool and a companion, but the component to summarize the web collections. 392 00:46:32,000 --> 00:46:36,000 We release these tools under open source license. 393 00:46:36,000 --> 00:46:42,000 And we we use these tools to enrich our own web connections. 394 00:46:42,000 --> 00:46:47,000 Begin inside from generated staff to improve our callers. 395 00:46:47,000 --> 00:46:54,000 Finally, we demonstrated that these tools are not explosive to the Internet archives, so anyone can use them to their own. 396 00:46:54,000 --> 00:46:59,000 The archival collection, and learn more more about what they are having. 397 00:46:59,000 --> 00:47:05,000 With that I will see you in Qa. and thank you so much for with me. 398 00:47:05,000 --> 00:47:10,000 Thank you so much. so we would. You know that lots of comments and questions coming in in the chat. 399 00:47:10,000 --> 00:47:14,000 So I think we'll we'll have a good question for you in just a minute. 400 00:47:14,000 --> 00:47:25,000 But up next we have a art rhino and I had the chance of watching this video, and I think it's gonna blow your mind. 401 00:47:25,000 --> 00:47:29,000 It's a little, Willie wonka and just charming and wonderful. 402 00:47:29,000 --> 00:47:39,000 So arts gonna show you in this video kind of the scanning robot that he's built alongside one of the Internet archives. 403 00:47:39,000 --> 00:47:41,000 Scanning frames the tt scanner. 404 00:47:41,000 --> 00:47:47,000 So, Caitlin, let's roll the the video hi there! 405 00:47:47,000 --> 00:47:52,000 My name is our rhino. I work at the University of Windsor and Ontario, Canada, and I'm. 406 00:47:52,000 --> 00:47:57,000 On the board of a nonprofit organization called our Digital World, or Odw. 407 00:47:57,000 --> 00:48:02,000 My library acquired a very nice tabletop scanning machine from the Internet Archive. 408 00:48:02,000 --> 00:48:09,000 Years ago. This unit currently has good availability and i've become interested in scanning the back file for a major paper collection. 409 00:48:09,000 --> 00:48:13,000 A lot of the material is well positioned for scanning. 410 00:48:13,000 --> 00:48:16,000 Some of it can be disassembled, and the pages can be fed through. 411 00:48:16,000 --> 00:48:23,000 Our Xerox can feature. However, there's a lot of spit binding. I've been trying to find a way to scan it without compromising the binding. 412 00:48:23,000 --> 00:48:33,000 Here are the components i've used to pull together. on a Mag scanning most of which were already on hand from Odw's work on Diy Mic Films scanners. 413 00:48:33,000 --> 00:48:40,000 One new edition as an inflatable snow tube which came from the end of season table at a local hardware store. 414 00:48:40,000 --> 00:48:45,000 The year. Mattress pump inflates the snow tube, and a vacuum cleaner holds on to the page. 415 00:48:45,000 --> 00:48:51,000 The device in the front that looks like a car is the embod, or in Block Inbot. 416 00:48:51,000 --> 00:48:54,000 My embod is several years old, and I use the in-bloc. 417 00:48:54,000 --> 00:49:02,000 Id on the web for controlling it. There are wireless and bluetooth options for newer models. That would be more elegant for this kind of thing. 418 00:49:02,000 --> 00:49:07,000 However, this worked well enough, and it seems to do the trick for what we want to do. 419 00:49:07,000 --> 00:49:16,000 Probably the biggest drawback of all this is the noise from the air mattress pumping the vacuum cleaner, which you can't hear here, but otherwise i'm pretty happy with this approach. 420 00:49:16,000 --> 00:49:21,000 It takes about 30 to 45 min to scan the typically 100 pages in a title. 421 00:49:21,000 --> 00:49:29,000 You can find way more sophisticated and faster devices on, Youtube and it would be cool to make something more general purpose and robust. 422 00:49:29,000 --> 00:49:37,000 But this is one approach for automatically scanning material on the tabletop scribe that is hopefully useful for others. 423 00:49:37,000 --> 00:49:50,000 Thanks for listening. So I I have it on good authority that art does a lot of really interesting projects, and is, as you can see, as a little bit of a tinkerer. 424 00:49:50,000 --> 00:49:55,000 And just what creativity I think that someone mentioned like a rotisserie chicken. 425 00:49:55,000 --> 00:50:00,000 Turning thing next to a microfilm scanner, anyway. 426 00:50:00,000 --> 00:50:11,000 Interesting stuff there. I would like. to welcome or i'd like to bring Michelle, and so we back to the to the screen for a couple of questions. 427 00:50:11,000 --> 00:50:15,000 And so the first question that I have, if you wanna turn your cameras on. 428 00:50:15,000 --> 00:50:18,000 Thanks, Michelle. The The question that I have is for you. 429 00:50:18,000 --> 00:50:28,000 Are you data points, or related data that you'd like to including your research that simply weren't or aren't available in digital or computable form. 430 00:50:28,000 --> 00:50:33,000 And so they's a great question. there are holes in the collection. 431 00:50:33,000 --> 00:50:47,000 Unfortunately, the Internet archive hasn't gone back quite far enough for for what we'd love to do which is kind of go back sort of to the beginning to include a lot more of the data from say green spans time period or or even 432 00:50:47,000 --> 00:50:57,000 people passed it passed on that. and sometimes some of the audio files are some other forms of communications that could have happened around the time. 433 00:50:57,000 --> 00:51:09,000 So some of that's not available some we're still trying to dig through some archives to see whether or not they've been put someplace else or sort of miss misplaced on on some of that but there 434 00:51:09,000 --> 00:51:15,000 is the unfortunate thing that I mean, why the Internet archives so important that things have gotten destroyed over time. 435 00:51:15,000 --> 00:51:18,000 So we would really like to have it as inclusive as possible. 436 00:51:18,000 --> 00:51:27,000 We'd also like to look at some of the issues between Janet Yellen and other female fed authorities, and whether or not there's different responses on them. 437 00:51:27,000 --> 00:51:33,000 And just sometimes the the video feeds and everything like that just aren't there right now. 438 00:51:33,000 --> 00:51:42,000 So those are the primary ones. we're getting much better at getting stock data and everything coming in. So that's not quite so much of a problem fascinating. 439 00:51:42,000 --> 00:51:45,000 Thanks so much for that. So we would a question for you. 440 00:51:45,000 --> 00:51:49,000 The comment is, Thank you for making your cdx tool open source. 441 00:51:49,000 --> 00:51:55,000 What other projects would you envision using your tool? Oh, interesting! 442 00:51:55,000 --> 00:52:07,000 So One immediate application. I was thinking the other day is there is an emerging format called wacky, 443 00:52:07,000 --> 00:52:12,000 Which is like bundling a bunch of work files with their indices and all that stuff. 444 00:52:12,000 --> 00:52:21,000 I think this tool can integrate well inside in there so it's someone kind of you know, loads the faxie file in a browser or they can see what what really is inside. 445 00:52:21,000 --> 00:52:27,000 And it has been a challenge like I mean director of the payback machine. 446 00:52:27,000 --> 00:52:34,000 He often asked me, Okay, So here is this item we collected what what is inside in there, and you know, is the tool. 447 00:52:34,000 --> 00:52:40,000 Go run a game state, and you will have, like a bunch of few models that you can play with and explore from there again. 448 00:52:40,000 --> 00:52:47,000 I mean, this is both a a statistical insight as well as kind of new exploratory. 449 00:52:47,000 --> 00:52:53,000 Another option. i've been looking forward for is to kind of know 450 00:52:53,000 --> 00:52:58,000 These random url that we pull out Sundays, and those are not just random. 451 00:52:58,000 --> 00:53:12,000 They have some other criteria to that we can use those to generate thumbnails and have like a slide show, or something like that to to have a more visual you know understanding of what's been pied in a 452 00:53:12,000 --> 00:53:15,000 collection and use it as a as a way to kind of tell. 453 00:53:15,000 --> 00:53:21,000 Tell a story about a collection that's that thank you for that. 454 00:53:21,000 --> 00:53:28,000 For that explanation. I I understand from a from a message from Michael Nelson in chat that you're gonna be teaching a course next semester. 455 00:53:28,000 --> 00:53:38,000 Oh, yes, it will be webinar design. So yeah, my students will be competing with say, Engine X and Apache and some of other fancy books are work. 456 00:53:38,000 --> 00:53:46,000 Hope to have a more you know. Compliant with server with rice or standards. 457 00:53:46,000 --> 00:53:52,000 Then then some of these other Well, known what's that what's that. 458 00:53:52,000 --> 00:53:57,000 Say, Well, good luck to you in the in the course, and 459 00:53:57,000 --> 00:54:00,000 Your students are in excellent hands for sure, Michelle. 460 00:54:00,000 --> 00:54:07,000 So we would thank you so much for your presentations today, and we'd like to keep moving through the through today's talks. 461 00:54:07,000 --> 00:54:12,000 If you had additional questions for Michelle, or so we would please do drop them off into the Q. 462 00:54:12,000 --> 00:54:23,000 And a but what i'd like to do now is welcome Spencer tourine to the to the screen from Thompson Reuters special services and spencer's gonna talk about the work that he's done 463 00:54:23,000 --> 00:54:27,000 with automatic hashtag hierarchy generation and I can't wait to hear that. 464 00:54:27,000 --> 00:54:32,000 So spends her over to you 465 00:54:32,000 --> 00:54:38,000 Yep 466 00:54:38,000 --> 00:54:47,000 Okay, hello, welcome everyone. Yeah. So i'd like to talk about well, what Chris just said. 467 00:54:47,000 --> 00:54:54,000 So. The idea is, you know we have all these tags hashtags on Twitter, and the question is, do they have a hierarchy? 468 00:54:54,000 --> 00:55:02,000 Is there structure to them and i'll talk a little bit about the difference between ontologies and folksonomies really quickly. 469 00:55:02,000 --> 00:55:08,000 So we're all probably, if not familiar with ontologies by name, at least by concept. 470 00:55:08,000 --> 00:55:14,000 So we have things like the dictionary, shall we say in the dictionary there are very curated terms. 471 00:55:14,000 --> 00:55:20,000 There are rigorous definitions of these terms, and there are, you know, well-defined relations between these terms. 472 00:55:20,000 --> 00:55:26,000 So an example is, a foot has 5 toes, and as part of the human body there are many more definitions of what a foot is. 473 00:55:26,000 --> 00:55:30,000 You can see on the right, and I think on the right is just some of them. 474 00:55:30,000 --> 00:55:36,000 But we have to compare that with the difficulty of looking at let's say twitter in this respect. 475 00:55:36,000 --> 00:55:44,000 So on Twitter. it's been called a folks on me so these hashtags that are used are very arbitrary. 476 00:55:44,000 --> 00:55:50,000 Their definitions are circumstantial. there are very undefined relationships to them. 477 00:55:50,000 --> 00:55:58,000 I just created a a hashtag down there, 2 hashtags done there, and if I use them on Twitter, they would now exist in the record. 478 00:55:58,000 --> 00:56:03,000 What do some of those hashtags even mean I don't even know? 479 00:56:03,000 --> 00:56:07,000 Really. I just put up some nonsense just to illustrate the fact that I could do it. 480 00:56:07,000 --> 00:56:18,000 So yeah. if we want to put some structure to these hashtags, we have to be able to ask at least the minimal question, which is, you know, are there hashtags which are more general than others shall we 481 00:56:18,000 --> 00:56:22,000 say, is hashtag dog more general than Dalmatian. 482 00:56:22,000 --> 00:56:27,000 If dog in our, in our anthological thinking, is more general than dalation, Is that true? 483 00:56:27,000 --> 00:56:31,000 On Twitter is Ai has to take Ai more general than hashtag machine learning. 484 00:56:31,000 --> 00:56:45,000 What does General mean on Twitter? and ontology. would say that a a general word is perhaps due to the breadth of applicability of the term, and that, again would be by the definition. 485 00:56:45,000 --> 00:56:59,000 There's nothing to say that you could or should use a term very often, but by the definition of the term, you could say that the term is general with folksonomies, since there's no real clear definition for these 486 00:56:59,000 --> 00:57:06,000 things. We probably have to take the question in a different direction, and talk about the breath of application rather than the breath of applicability. 487 00:57:06,000 --> 00:57:10,000 So we're gonna ask certain contextual questions like you know when was he used? 488 00:57:10,000 --> 00:57:17,000 Where was he who used it? How is it used? So we look at context as proxies for generality? 489 00:57:17,000 --> 00:57:21,000 So there's a classic one which would be the popularity context. 490 00:57:21,000 --> 00:57:27,000 If a lot of people are using it that might indicate it's for general use, there are lots of temporal contexts. 491 00:57:27,000 --> 00:57:36,000 So here are some examples. hashtag disease would be you know, not related to a specific event, although Covid 19 is. It's a big deal. 492 00:57:36,000 --> 00:57:44,000 It's a big event. But it's none loss of an event, and in some years i'm probably will see that Hashtag die out. 493 00:57:44,000 --> 00:57:48,000 You can have a general holiday which could happen, you know, any time of the year. 494 00:57:48,000 --> 00:57:53,000 Halloween is generally one time a year you could have hashtag work which occurs throughout the week. 495 00:57:53,000 --> 00:58:03,000 Tgif might be a Friday thing, and you could certainly have the hashtag food any any time of the day and breakfast maybe more so in the morning. 496 00:58:03,000 --> 00:58:12,000 And then the all important semantic contexts. so you know, if you have a hashtag that co-occurs with many other different hashtags. 497 00:58:12,000 --> 00:58:15,000 This is sort of the state of the art of the moment. You know which other? 498 00:58:15,000 --> 00:58:19,000 How many other different, unique hashtags is a hashtag used with. 499 00:58:19,000 --> 00:58:25,000 So if if very many, then you could call it general, and if not very many, then maybe not so general. 500 00:58:25,000 --> 00:58:33,000 And then we can also look at topics which you might consider to be groups of hashtags is used within a group or outside the group, and also tokens and words. 501 00:58:33,000 --> 00:58:47,000 So is it used with many different ideas. or not and the way we measure all this is basically we we call it the what we don't call it the Shannon diversity index is what it is called this is an 502 00:58:47,000 --> 00:58:54,000 ecologically inspired term. It is just Shannon entropy. For those of you who are not familiar with Shannon and entropy. 503 00:58:54,000 --> 00:59:00,000 I will briefly describe it. So on the left you see something with very low entropy, very low diversity. 504 00:59:00,000 --> 00:59:10,000 You kind of know what you're getting with that thing where it fits, what context it's used in, and then over on the very far right you can you would have no idea when and where you might see this thing. 505 00:59:10,000 --> 00:59:14,000 So it would be, you might consider it to be very diverse. 506 00:59:14,000 --> 00:59:19,000 I mean, let me just take you know I mentioned the 8 different contexts a few slide previously. 507 00:59:19,000 --> 00:59:31,000 That's what you know. you have 8 of these diversity measures for each hashtag, and then you just multiply them by some waiting, and you come up with this ensemble diversity index and that value would be an 508 00:59:31,000 --> 00:59:39,000 indication of how general that hashtag is the higher the value the more general it is, the lower the value, the less general it is, the more specific. 509 00:59:39,000 --> 00:59:42,000 So i'll i'll get to the data the all important data. 510 00:59:42,000 --> 00:59:47,000 So you know, we want to archive for it we we used Twitter's 1% spritzer stream. 511 00:59:47,000 --> 00:59:55,000 And we got 52 months of data between October the twentieth 16 in December the twentieth, 2146,000,000 English language tweets. 512 00:59:55,000 --> 01:00:00,000 We have many more than that other languages, and 360,000 hashtags. 513 01:00:00,000 --> 01:00:06,000 So we took all that data. We boil it down into a hashtag network the co-occurring hashtags. 514 01:00:06,000 --> 01:00:11,000 And then we used this network to calculate a lot of our measures. 515 01:00:11,000 --> 01:00:14,000 So here's some of the hierarchy so you know. 516 01:00:14,000 --> 01:00:22,000 Here's the data community, or shall we say that the top 10 most diverse hashtags within the data community. 517 01:00:22,000 --> 01:00:28,000 There are about over 2,000 other hashtags in this community alone we can see at the top. 518 01:00:28,000 --> 01:00:37,000 You have pretty familiar words there that you may recognize and then if we look across other communities of hashtags. 519 01:00:37,000 --> 01:00:49,000 So here's the there's the data community the beer community, the coffee community, and then the dog community at the time, you know, the most diverse, the most general hashtags are very familiar terms You would recognize most all 520 01:00:49,000 --> 01:00:53,000 of them, and then at the bottom, the least diverse hashtags the most specific, you know. 521 01:00:53,000 --> 01:00:58,000 I I don't i've never seen these before. they they look like they sort of make sense in the community. 522 01:00:58,000 --> 01:01:02,000 But they're they're not they're very narrowly used, and 523 01:01:02,000 --> 01:01:07,000 There's a beautiful tail here, which is based on our our findings. 524 01:01:07,000 --> 01:01:16,000 Hashtag love happens to be the most diverse hashtag, according to our method. according to the archive org data that we have. 525 01:01:16,000 --> 01:01:21,000 So that's that's a beautiful story there and you can see some of the related communities. 526 01:01:21,000 --> 01:01:34,000 To the love community. These are the most tightly corresponding communities, and you know, not shown our thousands of other communities and hundreds of thousands of other hashtags. 527 01:01:34,000 --> 01:01:38,000 But you can. You can certainly, you know, dive into it. 528 01:01:38,000 --> 01:01:41,000 So what can be used it? It can be used as a hashtag recommended. 529 01:01:41,000 --> 01:01:44,000 So if hashtag donation maybe also hashtag dog. 530 01:01:44,000 --> 01:01:51,000 But the idea of this general to specific ordering may I suggest that you don't want to do a hashtag dog. 531 01:01:51,000 --> 01:01:53,000 Maybe also hashtag donation that doesn't really work 532 01:01:53,000 --> 01:01:57,000 You can look at hierarchical hashtag topic models. 533 01:01:57,000 --> 01:02:00,000 So we know that hashtag ai is part of the data community. 534 01:02:00,000 --> 01:02:04,000 So that's you know. good information no we also know that it is the data community. 535 01:02:04,000 --> 01:02:08,000 We can do social science investigations. What does hashtag Minnesota mean? 536 01:02:08,000 --> 01:02:20,000 And when was that meant? It certainly was more geographical prior to May of 2,020, with the murder of George Floyd in Minneapolis, when it Minnesota, was now associated with the Black lives 537 01:02:20,000 --> 01:02:23,000 matter movement, and you, you know, can use other platforms to. 538 01:02:23,000 --> 01:02:29,000 We showed Twitter. But you know this. any any tags will will do for this. 539 01:02:29,000 --> 01:02:33,000 So. thank you for your time. I appreciate it and thank you to the presenters again. 540 01:02:33,000 --> 01:02:40,000 And if you're interested these are the names of the papers relevant, Thank you so much, Spencer. 541 01:02:40,000 --> 01:02:48,000 We have some questions that are coming in, and if you and the audience have additional questions, please do drop them into the Q. and A. 542 01:02:48,000 --> 01:02:54,000 The one thing that I was just struck by that data. Beer, coffee dog, hashtags are life. 543 01:02:54,000 --> 01:02:58,000 So hold tight there for a second before we go on. 544 01:02:58,000 --> 01:03:11,000 I wanna acknowledge i'm remiss in not acknowledging art rhino who is in the audience today that that fantastic video that we just watched aren't artists here watching along with us So Thank 545 01:03:11,000 --> 01:03:21,000 you art for for that submission and for your creative work up next in our show. is another video, and another presenter who is also in the audience. 546 01:03:21,000 --> 01:03:29,000 So we're gonna hear from Jim salman who is going to tell us about his Internet archive enabled journey as a digital humanities. 547 01:03:29,000 --> 01:03:43,000 Citizen science. So caitlin let's roll Hi I'm Jim Salmons and this lightning talk is a whirlwind trip through my post-cancer journey of rebirth as a digital humanity citizen scientist 548 01:03:43,000 --> 01:03:48,000 at how the inspiration of an access to the Internet Archive made that possible. 549 01:03:48,000 --> 01:04:00,000 Starting in 2,012, through 2,014. Both my wife, Tim Labbitsky and I had terrifying cancer battles that we fortunately survived to pay it forward in celebration of our 20 fifth wedding 550 01:04:00,000 --> 01:04:08,000 anniversary. we funded the digitization of the 48 issues of the Apple computer focus soft Talk magazine into the Internet. 551 01:04:08,000 --> 01:04:14,000 Archive, soft talk has a special place in my heart. As I was a reader. 552 01:04:14,000 --> 01:04:28,000 Advertiser writer, and during the time of its explosive growth and executive at softball publishing, where I designed and helped develop the software that ran the back office production and advertising processes as part of our 553 01:04:28,000 --> 01:04:40,000 initial soft talk preservation project. Tim and I went to the midwest scanning center of the Archive, where we saw firsthand and participated in the amazing behind-the-scenes scanning service It 554 01:04:40,000 --> 01:04:49,000 puts digital collections into the Internet Archive, not wanting our contribution to end with simply getting the digital edition of soft talk into the archive. 555 01:04:49,000 --> 01:04:55,000 We followed up with activity. That was the true beginning of my rebirth as a digital humanity. 556 01:04:55,000 --> 01:05:09,000 Citizen scientist, as I learned about the challenges of text and data mining of digital collections within the cultural heritage domain, I decided to create a ground. true storage format would support an integrated model of a 557 01:05:09,000 --> 01:05:26,000 Magazine's complex document structures and content depiction between 2015, or 19, I developed a personal learning network of largely Eu and Uk based mentors and collaborators to support the development of the Magazine Gts 558 01:05:26,000 --> 01:05:37,000 metadata format based on international museum, ontology standards with posters and papers, accepted the Eu-based digitization conferences and workshops. 559 01:05:37,000 --> 01:05:45,000 My work, initially focused on the structure and content of advertisements and soft talk. 560 01:05:45,000 --> 01:05:54,000 Everything came to crashing halt when I suffered a devastating spinal cord injury in 2,020 during the year and a half of my rehab and recovery. 561 01:05:54,000 --> 01:06:07,000 The digital humanities domain saw explosive growth in the use of machine learning technologies. As I reinvigorate my research, I have expanded my focus from the softball collection. 562 01:06:07,000 --> 01:06:20,000 Consider the challenges of investigating the massive Internet Archive collection of computer magazines, consisting of tens of thousands of issues of publications, dozens of languages published all around the world. 563 01:06:20,000 --> 01:06:26,000 Looking at the impact of computers and emergence of the digital world we live in today. 564 01:06:26,000 --> 01:06:33,000 My initial exploration of this expanded collection is focused on development of a ground truth. 565 01:06:33,000 --> 01:06:37,000 Data set of computer magazine talk or table of contents. 566 01:06:37,000 --> 01:06:45,000 Pages. talks serve as a seducco, puzzle-like set of hints about the document structures of a magazine. 567 01:06:45,000 --> 01:06:56,000 This data set will be invaluable for training machine learning models to help move digitization pipelines from within page to whole document, layout recognition. 568 01:06:56,000 --> 01:07:04,000 My goal now is to stimulate collaboration between my Eu and Ukraine friends with new partners from Stanford libraries. 569 01:07:04,000 --> 01:07:17,000 It's Ai lab the computer history, museum and the Internet archive, the forge, a research consortium to further preserve and make accessible for scholarly research and public interest. 570 01:07:17,000 --> 01:07:23,000 The computer magazines met a collection at the Internet Archive on behalf of myself and Tim Lubbitsky. 571 01:07:23,000 --> 01:07:40,000 Thank you to the archive and webinar organizers for inviting me to present this lightning talk. and thanks to you, Jim, for sharing your inspiring story, your video, your work, and for sponsoring the digitization of the 572 01:07:40,000 --> 01:07:44,000 materials that are now available to all at the Internet Archive. 573 01:07:44,000 --> 01:07:52,000 That your table of contents, Tooc, or talk work is of hi interest to us. 574 01:07:52,000 --> 01:08:01,000 I know also, from my previous work with the biodiversity Heritage Library, that those tables of contents are really that's where it's all at in terms of the structural metadata for 575 01:08:01,000 --> 01:08:12,000 the for the article So We'll be following up with learn a little bit more about what you're doing there. So up next i'd like to welcome Emmauel tranos to the screen. 576 01:08:12,000 --> 01:08:17,000 Who's gonna tell us about the relationship that exists between the Web and cities? 577 01:08:17,000 --> 01:08:22,000 This is a really interesting talk. I know You're gonna love it so over to you. 578 01:08:22,000 --> 01:08:27,000 Emmanuel Kurtis. Thank you so much for this. 579 01:08:27,000 --> 01:08:35,000 Let me try my screen, and also tell you how excited about him to be part of this very cool A. 580 01:08:35,000 --> 01:08:44,000 So. my name is amado janos i'm a reader in quantitative. He wants geography at the University of Bristol and the islands. 581 01:08:44,000 --> 01:08:52,000 You include in the Uk. and I guess i'm i'm of this rare breed of geographers who have a vaccine with the with the Internet. 582 01:08:52,000 --> 01:09:07,000 So today, i'm gonna give you a brief overview of our research using data from the Internet archive to understand the link between web and seeds, we're using such data to understand every Internet. 583 01:09:07,000 --> 01:09:13,000 But also other interesting a it's your account so what data. do we use? 584 01:09:13,000 --> 01:09:21,000 We use the We're using data that we say curated by the British library here in Uk. 585 01:09:21,000 --> 01:09:26,000 This is data called the Disc Uk Web Domain data set. 586 01:09:26,000 --> 01:09:40,000 And this is simply all. This contains all the archived web pages from data Archive under the Dot Uk Top level domain during the 996 to 2,012 P. 587 01:09:40,000 --> 01:10:01,000 The business library did something very clever. here. they scanned the web text of all these archived web pages and created a different further subset, which only includes these archive web page which contain Hey Uk postcode. 588 01:10:01,000 --> 01:10:07,000 We feed their web text, and I believe you can see the link to these data sets in the top. 589 01:10:07,000 --> 01:10:14,000 So we started our research with almost half a 1 billion of of lines which look like this. 590 01:10:14,000 --> 01:10:29,000 We know that this archive Url is contains within its work Text: this Uk Post code, and this post code refers to a very small and area almost a block in the Vj. 591 01:10:29,000 --> 01:10:33,000 And you also know the timeest of these, a a archive. 592 01:10:33,000 --> 01:10:38,000 On September of May, the ninth of September, of 2,008. 593 01:10:38,000 --> 01:10:51,000 So what do we do with these data? first thing we use these data to create a a measure of the online content of local interest. 594 01:10:51,000 --> 01:11:08,000 And we must. These these data with that and a large individual survey. And we're able to illustrate that we are going ability of online content, of local interest actually attracts individuals online. 595 01:11:08,000 --> 01:11:15,000 We knew a lot about the fact that that's not pushing the videos, you know, to come back to their Internet to spend more time online. 596 01:11:15,000 --> 01:11:25,000 But this was the first time we were able to say something about the other Maxwell factors that are attracting me with us to spend more time online. 597 01:11:25,000 --> 01:11:33,000 And importantly, to do this at the very local state at the difference scale at a different a A. 598 01:11:33,000 --> 01:11:40,000 A. I started we used such data in order to understand economic classes. 599 01:11:40,000 --> 01:11:46,000 And when I say economic clusters, I refer to these neighborhoods within cities. 600 01:11:46,000 --> 01:11:54,000 That host, very, You know, specific economic activities, these very specialized a neighborhoods or specific. 601 01:11:54,000 --> 01:12:15,000 We focused on solving. This is a very well-known tech cluster here in London, and we used archived web data in order to map the evolution of this economic cluster over space and time but importantly to map the 602 01:12:15,000 --> 01:12:21,000 evolution of this cluster also in terms of the types of economic activities that took place. 603 01:12:21,000 --> 01:12:35,000 You know, within this cluster and we're. able you know to extract meaningful types of economic activities, much more detailed than official data would have enabled us to do so. 604 01:12:35,000 --> 01:12:47,000 Ideally We're Chasing. the scale of of analysis, and we're moving, you know, from a small neighborhood in in London to the whole of the Uk we utilize. 605 01:12:47,000 --> 01:12:56,000 We employed such archives and web data in order to test the economic effects. 606 01:12:56,000 --> 01:13:03,000 That's the early adoption of web technologies can generate to reaches here in the Uk. 607 01:13:03,000 --> 01:13:15,000 So we're able to build you know measures of the volume of online commercial, you know, content buck from 2,000. 608 01:13:15,000 --> 01:13:24,000 And we're able to to link you know these measures from 2,000 to specific regions within the Dvd. 609 01:13:24,000 --> 01:13:33,000 We then utilized economics techniques and we're able to illustrate an interesting license the volume of online content. 610 01:13:33,000 --> 01:13:45,000 But from 2,000, which to us represents the area. Adoption of web technologies is actually associated with positive economic productivity effects. 611 01:13:45,000 --> 01:13:56,000 But these productivity effects a long lasting. They are long term, you know, a a positive effects that regions work, 612 01:13:56,000 --> 01:14:16,000 Who Who? who employed this technology, Ariel, are able to enjoy it for longer, for quite lengthy a a time periods. and to last, but not least at the similar state we used such archived web data and more specifically 613 01:14:16,000 --> 01:14:27,000 html links between commercial websites to predict trade between different Uj reasons. 614 01:14:27,000 --> 01:14:38,000 Watch the Cdcally, we're able to make out of sample predictions, using my singular link algorithms regarding this regional trade flow. 615 01:14:38,000 --> 01:14:46,000 Was this important because there is hardly any data for trade between retail so it's not the small state. 616 01:14:46,000 --> 01:15:00,000 So by using this freely available, you know data that data like I, if you know, collects for all of us, were able to make predictions for such an important policy element and have local authorities utilize. 617 01:15:00,000 --> 01:15:18,000 These data that we generated. So all you know, using this filler variable data, were able to map the evolution and the geography of the engagement with the Internet, especially at this early status and trust me there is hardly 618 01:15:18,000 --> 01:15:26,000 any other data, you know, that can go so back in time and also so granular by doing that, we're able to draw important lessons. 619 01:15:26,000 --> 01:15:43,000 Righting the deployment of other future technologies, and also what able to understand economic activities at the very day level, both in terms of space and time, but also in kind of context, take place. 620 01:15:43,000 --> 01:15:56,000 Within. And so, between scenes all of these research can be found in in in various papers who have published in publications as well, that you can find on my website. 621 01:15:56,000 --> 01:16:04,000 Again. Thank you so much for having me here today. Thank you, Emmanuel, for sharing your research. 622 01:16:04,000 --> 01:16:07,000 Oh, we do have a question for you and for Spencer. 623 01:16:07,000 --> 01:16:12,000 But we wanna wrap up today with a final video from from our session. 624 01:16:12,000 --> 01:16:17,000 So we're gonna welcome Tom galley back to the screen virtually. 625 01:16:17,000 --> 01:16:29,000 And he's gonna share his his research and his work into the forgotten novels of the nineteenth century 626 01:16:29,000 --> 01:16:34,000 I started reading old books back in the 1,900 sixtys when I was still a child. 627 01:16:34,000 --> 01:16:42,000 So how did I decide what books to read? Well, we had a lot of books at home, and I read some of those. 628 01:16:42,000 --> 01:16:53,000 There was a small public library nearby, and I like to browse the shelves there, too, and when I got my allowance I would go to the local bookstore. 629 01:16:53,000 --> 01:16:59,000 Look through those shelves and maybe buy a couple of paperbacks that got my eye. 630 01:16:59,000 --> 01:17:05,000 The books I happen to see on those shelves would shape my reading choices in the years to come. 631 01:17:05,000 --> 01:17:14,000 When I started reading nineteenth century novels, I naturally gravitated towards authors and books that I had seen on those shelves. 632 01:17:14,000 --> 01:17:18,000 Charles Dickens Jane Austen Nathaniel Hawthorne. 633 01:17:18,000 --> 01:17:23,000 Crime and punishment, the adventures of Huckleberry Finn. 634 01:17:23,000 --> 01:17:37,000 In other words, the classics. It was only later, after I had access to a large university library that I discovered the vast number of other novels published in the nineteenth century. 635 01:17:37,000 --> 01:17:43,000 Their pages were yellow and brittle, and most had never been reprinted. 636 01:17:43,000 --> 01:17:48,000 But after I graduated from college I could no longer access those books. 637 01:17:48,000 --> 01:18:07,000 I was stuck with the classics again. so fast forward to around the year, 2,010 libraries around the world were now scanning their old books, and the Internet Archive was making those books available online for free now anyone in the 638 01:18:07,000 --> 01:18:17,000 world could read all of those old novels, including the thousands that publishers had not reprinted and marketed as classics. 639 01:18:17,000 --> 01:18:24,000 So in 2,021 just for fun, compile the list of nineteenth century novels at the Internet. 640 01:18:24,000 --> 01:18:30,000 Archive chose only novels that nobody seemed to be reading anymore. 641 01:18:30,000 --> 01:18:36,000 But the once popular fiction that had been overlooked by the classics industry. 642 01:18:36,000 --> 01:18:44,000 You can find that list on the Internet Archives blog under the title Forgotten Novels of the Nineteenth Century. 643 01:18:44,000 --> 01:18:53,000 I've enjoyed dipping into those books maybe you will too 644 01:18:53,000 --> 01:19:06,000 So we'll share that link out to the to tom's blog for everyone in the email follow up. and I'm sure you'll find some new things that you haven't read for a while thanks Duncan for for 645 01:19:06,000 --> 01:19:15,000 sharing that out. I would like to bring spencer and Emmanuel back to the screen for a couple of questions, and the first one's dispenser. 646 01:19:15,000 --> 01:19:22,000 It's have there been changes in how twitter users use hashtags over time, especially more recently. 647 01:19:22,000 --> 01:19:28,000 That's a really good question. we haven't looked into it in that respect yet. but that is on the docket. 648 01:19:28,000 --> 01:19:30,000 So you know again i'll reference the hashtag Minnesota. 649 01:19:30,000 --> 01:19:37,000 Example, where, prior to May, the twentieth 20 all geography, I mean, it has to do with States capitals. 650 01:19:37,000 --> 01:19:43,000 Right, no mention of civil rights or anything. And then, after may the twentieth 20 very much non geographical. 651 01:19:43,000 --> 01:19:48,000 Well, incidentally geographical, having to do with black lives matter and the murder of George Floyd. 652 01:19:48,000 --> 01:19:53,000 So that that's an example. But it's a great question and something we intend to dive into. 653 01:19:53,000 --> 01:19:59,000 Thank you for that. and a a follow up question for Emmanuel. 654 01:19:59,000 --> 01:20:03,000 Is your research methodology extensible to other geographic areas? 655 01:20:03,000 --> 01:20:07,000 Or would you need to change your approach for studying areas outside the Uk? 656 01:20:07,000 --> 01:20:13,000 No, it is absolutely extendable and replicable. 657 01:20:13,000 --> 01:20:28,000 Outside of the Uk the main differences having you know, available, you know, datasets or subsets, you know, from payment that guy. and actually this one of the value of these data, because it's these are openly up available data source 658 01:20:28,000 --> 01:20:39,000 sources, and by using these data, what actually i'll perform some of the official data sources that's that's fascinating. 659 01:20:39,000 --> 01:20:50,000 I I know that there are some additional questions but i'm also sensitive about our time. So what i'm going to ask everyone to do is we're we're gonna have everyone has shared their contact. 660 01:20:50,000 --> 01:21:00,000 Information. So if you have additional questions for any of our presenters today, maybe we can follow up offline or after the session. 661 01:21:00,000 --> 01:21:08,000 So thank you, Spencer Emmanuel, for for your conversations today, and for taking some time to field some questions. 662 01:21:08,000 --> 01:21:17,000 Thank you much. and also, thanks to to Jim, and to Tom for the videos that they offered up Jim and Timeline as well. 663 01:21:17,000 --> 01:21:22,000 So here we are at the end we've made it 664 01:21:22,000 --> 01:21:25,000 I I wanna give it an acknowledgement to the 665 01:21:25,000 --> 01:21:38,000 Some people who helped guide this series from its start from inception, and we pull together an advisory group for our series, And that included some real heavy headers in in the digital humanity space. 666 01:21:38,000 --> 01:21:52,000 So a big Thank you. to Dan Cohen, from Northeastern to Mickeya Foster, from Broward County Library, Mike Furlough from Hoti Trust, and from Harriet Green at Washington University, in St. 667 01:21:52,000 --> 01:22:06,000 Louis. they've they really helped shape the talks that that we saw here today and help give guidance on how we should frame a conversation along a long tail Conversation a longitudinal chat If you will among our digital 668 01:22:06,000 --> 01:22:10,000 humanity scholars. So, thanks to the advisory committee for 669 01:22:10,000 --> 01:22:15,000 For helping us out with that i'm gonna go a little off script. 670 01:22:15,000 --> 01:22:27,000 I don't know if Brewster is still on the call I think that he is but I I would like to offer Brewster a chance if he's still available and he's near his computer to come on screen and 671 01:22:27,000 --> 01:22:31,000 give us a little a wrap up you know. 672 01:22:31,000 --> 01:22:34,000 What What do you think of Everything that you've seen here today? 673 01:22:34,000 --> 01:22:39,000 And across the series. Oh, this is so inspiring and so fun! 674 01:22:39,000 --> 01:22:45,000 And it's just to be able to see that the light and humorous nature of this. 675 01:22:45,000 --> 01:22:48,000 But as well as you know, answering these real questions. 676 01:22:48,000 --> 01:22:55,000 I, you know, and I, Of course I spring to the We have this more to to to just share with you. 677 01:22:55,000 --> 01:23:01,000 We have periodicals that have that electronic data processing schools is schools. 678 01:23:01,000 --> 01:23:10,000 Anyway. i'm so glad that this is going on I'm. and I think they only need format at least works for me. 679 01:23:10,000 --> 01:23:14,000 So thank you very much for pulling this together, Chris and everyone. 680 01:23:14,000 --> 01:23:18,000 Yeah, thanks. I I think this the lightning format it's nice to have the variety right. 681 01:23:18,000 --> 01:23:30,000 You can do some deep dives on some individuals topics and Then, get a broad overview of the wealth of materials and the wealth of research that's happening to in and around the collections at the Internet. 682 01:23:30,000 --> 01:23:36,000 Archive. What I would say is, we wind down here today is this is the start of a conversation. 683 01:23:36,000 --> 01:23:48,000 Let's think of this as the start of a conversation not as the end of one with this series part of what we wanted to do in bringing this together was to one raise awareness just that there are digital humanities projects that are actively, using 684 01:23:48,000 --> 01:23:51,000 the collections and the infrastructure at the Internet Archive. 685 01:23:51,000 --> 01:23:55,000 And so what we were hoping was to bring some of those people together. 686 01:23:55,000 --> 01:24:01,000 Provide, some visibility into the work that everyone is doing, and start that conversation, so that we can do more together. 687 01:24:01,000 --> 01:24:21,000 So to wind down here today on Behalf of the Internet archive and our present presenters today i'd like to thank you all for your time and your participation and a big thanks to everyone who's joined in the multiple sessions that we've